20 Gaussian Process

#GaussianProcess #Stationary #Isotropic #RBF #MaternKernel #BrownianBridgeKernel

Recall bivariate normal distribution: $(Y_{1}, Y_{2}) \sim N_{2} (\vec{μ}, Σ)$ , where $\vec{μ} = (\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}), Σ = (\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}) .$
We are curious about $Y_{2} | Y_{1} = a \sim N (μ_{2 | 1}, Σ_{2 | 1})$ . See Lecture 19, this implies $μ_{2 | 1} = μ_{2} + ρ \frac{σ_{2}}{σ_{1}} (a - μ_{1}), Σ_{2 | 1} = σ_{2}^{2} (1 - ρ^{2}) \leq σ_{2}^{2} .$
How can we visualize $Y_{1}, \dots, Y_{d}$ for $d > 2$ ?
Gaussian Process generalizes this concept to functions $Y {x \in R, x \in S \subset R}$ , $S$ can contain infinitely many points.

Stochastic Process

A Stochastic Process is a collection of random variables ${Y (x), x \in S}$ , where $S$ is the index set, like time.

Gaussian Process

A Gaussian Process (GP) is a stochastic process s.t. any finite collection ${Y (x_{1}), \dots, Y (x_{n})}$ is multivariate normal, i.e. $Y (\cdot) \sim GP (m (\cdot), k (\cdot, \cdot)) .$ $m$ is called mean function, $k$ is called covariance function/kernel. $E [Y (x)] = m (x), Cov (Y (x), Y (x^{'})) = k (x, x^{'}) .$
Moreover, define $\vec{m} = [\begin{matrix} m (x_{1}) \\ ⋮ \\ m (x_{n}) \end{matrix}], K = [\begin{matrix} k (x_{1}, x_{1}) & \dots & k (x_{1}, x_{n}) \\ ⋮ & ⋱ & ⋮ \\ k (x_{n}, x_{1}) & \dots & k (x_{n}, x_{n}) \end{matrix}] .$

Mean function: can be anything. Popular choices: $m (x) = 0, \forall x; m (x) = β^{T} x$ .
Covariance function:
- Symmetric
- PSD (positive semi-definitive)
- Stationary: $Cov (Y (x), Y (x^{'})) = k (x, x^{'}) = k (x - x^{'})$ .
- Isotropic: $Cov (Y (x), Y (x^{'})) = k (| | x - x^{'} | |)$ , is a stronger condition. $| | \cdot | |$ is an arbitrary norm.

Example: Radial Basis Function (RBF)

$K_{RBF} (x, x^{'}) = σ^{2} \exp {- \frac{1}{2 l^{2}} | x - x^{'} |^{2}}$ . $l > 0$ is called length scale.

Positive definite.

GP is Infinitely differentiable.

Sampling procedure:

Discretize $S$ as ${x_{1}, \dots, x_{D}}$ .

Sample $Y (x_{1}) \sim N (m (x_{1}), k (x_{1}, x_{1}))$ .

For $n = 1, \dots, D - 1$ , $(Y (x_{1}), \dots, Y (x_{n + 1})) \sim N_{n + 1} (\vec{m}, K)$ , where $\vec{m} = [\begin{matrix} m (x_{1}) \\ ⋮ \\ m (x_{n}) \\ m (x_{n + 1}) \end{matrix}], K = [\begin{matrix} k (x_{1}, x_{1}) & \dots & k (x_{1}, x_{n}) & k (x_{1}, x_{n + 1}) \\ ⋮ & ⋱ & ⋮ & ⋮ \\ k (x_{n}, x_{1}) & \dots & k (x_{n}, x_{n}) & k (x_{n}, x_{n + 1}) \\ k (x_{n + 1}, x_{1}) & \dots & k (x_{n + 1}, x_{n}) & k (x_{n + 1}, x_{n + 1}) \end{matrix}] .$
Denote ${\vec{C}}_{n + 1} = [k (x_{1}, x_{n + 1}) \dots k (_{n}, x_{n + 1})]^{T}$ .
Sample $Y (x_{n + 1})$ from $Y (x_{n + 1}) ∣ Y (x_{1}), \dots, Y (x_{n}) \sim N (μ_{n + 1}, σ_{n + 1}^{2})$ . So $$\begin{align*}
\mu_{n+1}&=m(x_{n+1})+\vec{C}{n+1}^{\mathrm{T}}K{n}^{-1}\begin{bmatrix}
Y(x_{1})-m(x_{1}) \ \vdots \ Y(x_{n})-m(x_{n})
\end{bmatrix},\
\sigma_{n+1}^{2}&= k(x_{n+1},x_{n+1})-\vec{C}{n+1}^{\mathrm{T}}K{n}^{-1}\vec{C}_{n+1}.
\end{align*}

! $K_{n}$ can be ill conditioned, leading to numerical issues.
A solution: replace $K$ with $K + τ^{2} I$ , where $τ^{2} I$ is a constant diagonal matrix. This is similar to the regularization in Ridge regression.
Interpretation: noisy observation. Each $Y (x_{i})$ is observed with some additive noise independent of GP: $Y (x_{i}) = g (x_{i}) + ε_{i}$ . $ε_{i} \overset{i . i . d}{\sim} N (0, τ^{2})$ are called the nugget, which are independent of GP.

Common kernels:

RBF Kernel: $K_{RBF} (x, x^{'}) = σ^{2} \exp {- \frac{1}{2 l^{2}} | x - x^{'} |^{2}} .$
Matern Kernel: $ν = p + \frac{1}{2}$ where $p \in N$ . $K_{p + 1 / 2}^{Matern} (x, x^{'}) = \frac{σ^{2} p!}{(2 p)!} \sum_{j = 0}^{p} \frac{(p + j)!}{j! (p - j)!} {[\frac{2 \sqrt{2 p + 1}}{l} | x - x^{'} |]}^{p - j} \exp (- \frac{\sqrt{2 p + 1}}{l} | x - x^{'} |) .$ The summation is a polynomial in $| x - x^{'} |$ of degree $p$ . So GP with $K_{ν}^{matern}$ is $⌈ ν ⌉ - 1$ times differentiable in the "mean square" sense.
Brownian Motion (Non-stationary): $K_{BM} (x, x^{'}) = min (x, x^{'}), x, x^{'} \in (0, \infty) .$
Brownian Bridge (Non-stationary): $K_{BB} (x, x^{'}) = min (x, x^{'}) - x x^{'}, x, x^{'} \in [0, 1] .$
This corresponds to a stochastic process conditioned on $Y (0) = 0, Y (1) = 0$ .
Ornstein-Uhlenbeck: $K_{OU} (x, x^{'}) = \frac{σ^{2} l}{2} \exp {- \frac{1}{l} | x - x^{'} |}, x, x^{'} \in (0, + \infty) .$ This model has applications in physics, biology, finance, etc.
Linear (Non-stationary): Bayesian Linear Regression. Suppose $Y (x_{i}) = β_{0} + β_{1} (x_{i} - c)$ , where $β_{0} \sim N (0, σ_{0}^{2}), β_{1} \sim N (0, σ_{1}^{2})$ and $β_{0} ⊥ ⊥ β_{1}$ . Then for $x \neq x^{'}$ , $\begin{aligned} Cov (Y (x), Y (x^{'})) = & Cov (β_{0} + β_{1} (x - c), β_{0} + β_{1} (x^{'} - c)) \\ = & σ_{0}^{2} + σ_{1}^{2} (x - c) (x^{'} - c) . \end{aligned}$ So the corresponding kernel is $K_{linear} (x, x^{'}) = σ_{0}^{2} + σ_{1}^{2} (x - c) (x^{'} - c) .$
Prediction:
Training data $D = {(x_{i}, y_{i}), i = 1, \dots, n}$ . How can we predict $Y (x)$ for a new sample point $x$ ? We should use sampling procedure.